Introduction

This project looks into the NYPD Shooting Incident Data obtained from https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic. The data describes shooting incidents in New York City dating back to 2006 and provides information about exact location, time, and information about the victim and suspect.Through this analysis, we aim to identify key factors that contribute to fatal shootings and to explore how variables such as borough, race, age, and gender of both the victim and perpetrator play a role in the likelihood of a shooting incident resulting in death.

Columns

Load the Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(lubridate)
library(leaflet)
library(leaflet.extras)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(e1071)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(caTools)

Importing the data

url_in <- "https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv?accessType=DOWNLOAD"
shooting_data <- read_csv(url_in)
## Rows: 28562 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (12): OCCUR_DATE, BORO, LOC_OF_OCCUR_DESC, LOC_CLASSFCTN_DESC, LOCATION...
## dbl   (7): INCIDENT_KEY, PRECINCT, JURISDICTION_CODE, X_COORD_CD, Y_COORD_CD...
## lgl   (1): STATISTICAL_MURDER_FLAG
## time  (1): OCCUR_TIME
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Tidying the data

### remove columns not needed
shooting_data_cleaned <- shooting_data %>% select(-LOCATION_DESC, -LOC_OF_OCCUR_DESC, -LOC_CLASSFCTN_DESC, -Lon_Lat)
### convert the OCCUR_DATE into a date object
shooting_data_cleaned <- shooting_data_cleaned %>% mutate(OCCUR_DATE = mdy(OCCUR_DATE)) 
### arrange from oldest to newest by OCCUR_DATE
shooting_data_cleaned <- shooting_data_cleaned %>% arrange(OCCUR_DATE)
### convert OCCUR_TIME into time format
shooting_data_cleaned$OCCUR_TIME <- hms(shooting_data_cleaned$OCCUR_TIME)
### convert categorical variables into factors
shooting_data_cleaned <- shooting_data_cleaned %>% mutate(
     BORO = factor(BORO),
     PRECINCT = factor(PRECINCT),
     JURISDICTION_CODE = factor(JURISDICTION_CODE), 
     PERP_AGE_GROUP = factor(PERP_AGE_GROUP), 
     PERP_SEX = factor(PERP_SEX), 
     PERP_RACE = factor(PERP_RACE), 
     VIC_AGE_GROUP = factor(VIC_AGE_GROUP),
     VIC_SEX = factor(VIC_SEX),
     VIC_RACE = factor(VIC_RACE))
### Get rid of NAs
shooting_data_cleaned <- shooting_data_cleaned %>% filter(
  !is.na(JURISDICTION_CODE),
  !is.na(PERP_SEX),
  !is.na(PERP_AGE_GROUP),
  !is.na(Longitude),
  !is.na(Latitude))
### combine unknown and null values in column
shooting_data_cleaned <- shooting_data_cleaned %>%
     mutate(PERP_AGE_GROUP = recode(PERP_AGE_GROUP,
                                    "UNKNOWN" = "Unknown",
                                    "(null)" = "Unknown"))
shooting_data_cleaned <- shooting_data_cleaned %>%
     mutate(PERP_RACE = recode(PERP_RACE,
                                    "UNKNOWN" = "Unknown",
                                    "(null)" = "Unknown"))
shooting_data_cleaned <- shooting_data_cleaned %>%
     mutate(PERP_SEX = recode(PERP_SEX,
                               "U" = "Unknown",
                               "(null)" = "Unknown"))
###recode "U"/"UNKNOWN" to be "Unknown" for consistancy
shooting_data_cleaned <- shooting_data_cleaned %>%
     mutate(VIC_SEX = recode(VIC_SEX,
                              "U" = "Unknown"))
shooting_data_cleaned <- shooting_data_cleaned %>%
     mutate(VIC_AGE_GROUP = recode(VIC_AGE_GROUP,
                             "UNKNOWN" = "Unknown"))
shooting_data_cleaned <- shooting_data_cleaned %>%
     mutate(VIC_RACE = recode(VIC_RACE,
                                   "UNKNOWN" = "Unknown"))
### locate unique values and clean up
unique(shooting_data_cleaned$PERP_AGE_GROUP)
## [1] 25-44   18-24   Unknown 45-64   <18     65+     224     940     1020   
## Levels: Unknown <18 1020 1028 18-24 224 25-44 45-64 65+ 940
shooting_data_cleaned <- shooting_data_cleaned %>%
  filter(!PERP_AGE_GROUP %in% c("224", "940", "1020", "1028")) %>% 
  droplevels() #remove empty factor level

unique(shooting_data_cleaned$VIC_AGE_GROUP)
## [1] 25-44   <18     18-24   45-64   65+     Unknown 1022   
## Levels: <18 1022 18-24 25-44 45-64 65+ Unknown
shooting_data_cleaned <- shooting_data_cleaned %>%
     filter(!VIC_AGE_GROUP %in% c("1022")) %>% 
     droplevels()

### obtain a summary
summary(shooting_data_cleaned)
##   INCIDENT_KEY         OCCUR_DATE           OCCUR_TIME                       
##  Min.   :  9953245   Min.   :2006-01-01   Min.   :0S                         
##  1st Qu.: 51724135   1st Qu.:2008-09-27   1st Qu.:3H 52M 0S                  
##  Median : 86012377   Median :2012-07-30   Median :15H 25M 0S                 
##  Mean   :122468275   Mean   :2013-12-27   Mean   :12H 59M 41.4232757270365S  
##  3rd Qu.:205280615   3rd Qu.:2019-11-17   3rd Qu.:20H 37M 0S                 
##  Max.   :279758069   Max.   :2023-12-29   Max.   :23H 59M 0S                 
##                                                                              
##             BORO         PRECINCT     JURISDICTION_CODE STATISTICAL_MURDER_FLAG
##  BRONX        :5849   75     : 1070   0:16200           Mode :logical          
##  BROOKLYN     :7039   73     :  913   1:   67           FALSE:15337            
##  MANHATTAN    :2714   44     :  746   2: 2886           TRUE :3816             
##  QUEENS       :2896   47     :  746                                            
##  STATEN ISLAND: 655   46     :  724                                            
##                       67     :  644                                            
##                       (Other):14310                                            
##  PERP_AGE_GROUP    PERP_SEX                              PERP_RACE    
##  Unknown:4262   Unknown: 2580   Unknown                       : 2918  
##  <18    :1673   F      :  443   AMERICAN INDIAN/ALASKAN NATIVE:    2  
##  18-24  :6424   M      :16130   ASIAN / PACIFIC ISLANDER      :  169  
##  25-44  :6032                   BLACK                         :11878  
##  45-64  : 697                   BLACK HISPANIC                : 1388  
##  65+    :  65                   WHITE                         :  298  
##                                 WHITE HISPANIC                : 2500  
##  VIC_AGE_GROUP     VIC_SEX                                VIC_RACE    
##  <18    :2134   F      : 2060   AMERICAN INDIAN/ALASKAN NATIVE:    9  
##  18-24  :6793   M      :17084   ASIAN / PACIFIC ISLANDER      :  343  
##  25-44  :8601   Unknown:    9   BLACK                         :13012  
##  45-64  :1405                   BLACK HISPANIC                : 1941  
##  65+    : 161                   Unknown                       :   52  
##  Unknown:  59                   WHITE                         :  582  
##                                 WHITE HISPANIC                : 3214  
##    X_COORD_CD        Y_COORD_CD        Latitude       Longitude     
##  Min.   : 920263   Min.   :127539   Min.   :40.52   Min.   :-74.23  
##  1st Qu.: 999876   1st Qu.:183445   1st Qu.:40.67   1st Qu.:-73.94  
##  Median :1007969   Median :197702   Median :40.71   Median :-73.91  
##  Mean   :1009049   Mean   :209630   Mean   :40.74   Mean   :-73.91  
##  3rd Qu.:1017042   3rd Qu.:240747   3rd Qu.:40.83   3rd Qu.:-73.88  
##  Max.   :1065474   Max.   :271128   Max.   :40.91   Max.   :-73.71  
## 
### The Unknowns in the applicable columns make sense for that category

Transformations for visualizations

# Get Yearly data
yearly_data_for_line_plot <- shooting_data_cleaned %>%
     mutate(Year = year(OCCUR_DATE)) %>%
     group_by(Year) %>%
     summarise(Incidents = n())


# What time of day or week do most shooting incidents occur?
times <- shooting_data_cleaned %>%
    mutate(
        Year = year(OCCUR_DATE),
        Month = month(OCCUR_DATE),
        Day = day(OCCUR_DATE),
        Hour = hour(OCCUR_TIME),
        Weekday = wday(OCCUR_DATE, label = TRUE)
    )
# Which Boroughs have the highest shooting incidents?
shooting_count_by_boro <- shooting_data_cleaned %>%
    group_by(BORO) %>%               
    summarise(Incidents = n()) %>%    
    arrange(desc(Incidents))

Visualizing the data

Line Chart 1: Line Chart of shooting incidents by Hour of Day

incidents_by_hour <- times %>%
     group_by(Hour) %>%
     summarise(Incidents = n()) %>% 
     ggplot(aes(x = Hour, y = Incidents)) +
     geom_line(color = "pink", linewidth = 1) +
     labs(
         title = "Shooting Incidents by Hour of the Day",
         x = "Hour of the Day",
         y = "Number of Incidents"
     ) +
     theme_minimal()
ggplotly(incidents_by_hour)

Line Chart 2: Line Chart of shooting Incidents over Years

ggplot(yearly_data_for_line_plot, aes(x = Year, y = Incidents)) +
  geom_line(color = "blue", size = 1) +
  geom_point(color = "red", size = 2) +
  labs(title = "Yearly Shooting Incidents", x = "Year", y = "Number of Incidents") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Analysis for shooting incidents by time

Line Chart 1 shows that most shooting incidents occurred in the cover of night either the early hours of the morning or later times in the day. As the sun sets, the number of shooting incidents seem to increase.

Line Chart 2 shows a decline in shooting incidents over years of 2006 to 2019 and then a sudden rise starting from 2020 to 2022 before starting to decrease again. There are some notable events that may have contributed to the increase in the number of shootings from 2019 to 2022 in New York such as the George Floyd protests as well as changes to laws that aimed to decrease the population of inmates in jail leading to repeat offenders being in the population. The sudden decrease in shooting incidents in 2023 could be the result of New York’s efforts to decrease the number of firearms in high-crime neighborhoods.

Bar Chart for shooting incidents by borough

shooting_by_boro_bar <- shooting_count_by_boro %>%
    ggplot(aes(x = reorder(BORO, Incidents), y = Incidents, fill = BORO)) + 
    geom_bar(stat = "identity", show.legend = FALSE) +     
    labs(
        title = "Number of Shooting Incidents by Borough",
        x = "Borough",
        y = "Number of Shooting Incidents"
    ) +
    theme_dark() +
    coord_flip()  
shooting_by_boro_bar

Heatmap of shooting locations

leaflet(shooting_data_cleaned) %>%
  addTiles() %>%  
  addHeatmap(~Longitude, ~Latitude, blur = 5, max = 1, radius = 11) %>%
  setView(lng = mean(shooting_data_cleaned$Longitude, na.rm = TRUE),  
          lat = mean(shooting_data_cleaned$Latitude, na.rm = TRUE),   
          zoom = 11)

Analysis shooting incidents by borough

There are shooting incidents all over New York, however the heatmap shows that the highest concentration of shootings occur in certain boroughs like the Bronx and Brooklyn neighborhoods. These boroughs were identified using the bar chart which displays the highest shooting incidents by borough. However, a heatmap offered a more visually pleasing view of exactly which areas of New York had the highest shooting incidents as viewers are able to see it spatially displayed on a map.

Modeling the data

#RandomForest to predict death vs. non-death
set.seed(555)

#split data into training and test data set
split <- sample.split(shooting_data_cleaned$STATISTICAL_MURDER_FLAG, SplitRatio = 0.8)

data_train <- subset(shooting_data_cleaned, split == TRUE)
data_test <- subset(shooting_data_cleaned, split == FALSE)

data_train$STATISTICAL_MURDER_FLAG <- as.factor(data_train$STATISTICAL_MURDER_FLAG) #ensure STATISTICAL_MURDER_FLAG is a factor
data_test$STATISTICAL_MURDER_FLAG <- as.factor(data_test$STATISTICAL_MURDER_FLAG)


#train model with fixed seed so easy to reproduce
rf_model <- randomForest(
  STATISTICAL_MURDER_FLAG ~ BORO + PERP_RACE + PERP_AGE_GROUP + VIC_RACE + VIC_AGE_GROUP + 
    PERP_SEX + VIC_SEX,
  data = data_train,
  importance = TRUE,
  ntree = 1500, # increase number of trees to 
  nodesize = 10,
  classwt = c(0.54, .46), # adjust weight because there are more non deaths than deaths to increase the accuracy of the model 
  keep.forest = TRUE,
  do.trace = 100 #show every 100 OOB value
)
## ntree      OOB      1      2
##   100:  41.75% 42.22% 39.83%
##   200:  42.07% 42.70% 39.53%
##   300:  42.41% 43.41% 38.36%
##   400:  42.29% 43.43% 37.70%
##   500:  42.37% 43.51% 37.80%
##   600:  42.30% 43.38% 37.93%
##   700:  42.19% 43.29% 37.77%
##   800:  42.00% 43.04% 37.80%
##   900:  42.05% 43.17% 37.54%
##  1000:  42.17% 43.22% 37.96%
##  1100:  42.07% 43.19% 37.57%
##  1200:  42.11% 43.15% 37.96%
##  1300:  42.07% 43.11% 37.93%
##  1400:  42.11% 43.21% 37.70%
##  1500:  42.15% 43.25% 37.73%
print(rf_model)
## 
## Call:
##  randomForest(formula = STATISTICAL_MURDER_FLAG ~ BORO + PERP_RACE +      PERP_AGE_GROUP + VIC_RACE + VIC_AGE_GROUP + PERP_SEX + VIC_SEX,      data = data_train, importance = TRUE, ntree = 1500, nodesize = 10,      classwt = c(0.54, 0.46), keep.forest = TRUE, do.trace = 100) 
##                Type of random forest: classification
##                      Number of trees: 1500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 42.15%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE  6963 5307   0.4325183
## TRUE   1152 1901   0.3773338
plot(rf_model)

#make predictions on the test data
rf_predictions <- predict(rf_model, newdata = data_test)

#plot prediction bar chart
 # Combine predictions with actual values into a data frame
comparison_df <- data.frame(
  Actual = data_test$STATISTICAL_MURDER_FLAG,
  Predicted = rf_predictions
)

# Create a bar plot of predicted vs actual values
ggplot(comparison_df, aes(x = Actual, fill = Predicted)) +
  geom_bar(position = "fill") +
  labs(title = "Predicted vs Actual Proportions", y = "Proportion") +
  scale_fill_manual(values = c("pink", "purple")) +
  theme_minimal()

#get confusion matrix
conf_matrix <- table(Predicted = rf_predictions, Actual = data_test$STATISTICAL_MURDER_FLAG)
print(conf_matrix)
##          Actual
## Predicted FALSE TRUE
##     FALSE  1805  287
##     TRUE   1262  476
#plot the factor importance
importance(rf_model)
##                     FALSE       TRUE MeanDecreaseAccuracy MeanDecreaseGini
## BORO             4.566203   8.332650             8.804379        123.42041
## PERP_RACE      -96.894825 112.149964           -76.355233        197.18183
## PERP_AGE_GROUP -87.898502 201.773839           -17.319437        506.95139
## VIC_RACE         1.315969  18.425863             9.791114        116.59023
## VIC_AGE_GROUP   10.374924  21.889063            21.009754        143.35430
## PERP_SEX       -49.833382  47.385749           -48.613113         89.13826
## VIC_SEX         -9.693797   4.963822            -6.526358         43.30935
varImpPlot(rf_model)

Analysis for model

A random forest model was used because the STATISTICAL_MURDER_FLAG results in False or True values that were turned into factors and decision trees in the random forest model are good at working with categorical variables in a large dataset. From the model summary, we can see that the error for data that was not part of the training data was 42.15% (OOB). This is quite high so the model is not very good at predicting if a shooting incident resulted in a death as supported by the confusion matrix. The bar chart shows shows the predicted vs the actual proportions of the deaths but the model could be improved. Although efforts were made to improve the model, because of the high amount of non-deaths in the data set, the model became really good at predicting non-deaths but resulted in extremely high errors for predicting deaths.

Based on feature importance, we can see that Boro is the most important feature to predicting whether a shooting incident results in a death. Through previous visualizations, we saw how the Bronx and Brooklyn neighborhoods had high occurrances of shooting incidents. Perhaps with more research we are able to see if being involved in a shooting incident in one of these neighborhoods is more predictive of a death.

The other important feature is the Victim’s Age Group. It is possible that the younger that a person is, the higher chance that a shooting incident could result in death. This could be confirmed with further analysis.

Conclusion / Bias Indentification

This analysis of the NYPD Shooting Incident Data shows that although there was a decline in shooting incidents since 2006 and then a rise before declining again, there are borughs where shooting incidents are more concentrated than in other boroughs. These findings suggest that there could be more strategies to mitigate the number of shooting incidents by focusing on the high concentration areas. The findings also suggest that by creating an improved model for factors contributing to the STATISTICAL_MURDER_FLAG could lead to more support, funding, and education in certain communities to prevent deaths.

A personal bias that may have influenced my analysis is that I have viewed all of New York has a high crime area and therefore was not surprised when the map showed much of New York covered in shooting incidents. I mitigated for this by creating a heatmap that identifies the most concentrated areas of shooting incidents instead of just having the map of New York with all the shooting incidents marked on it.